February 18, 2017

SmartType Introduction

Uses natural language processing to predict the next word given a set of previously typed words by the user. Algorithm mainly uses the R package quanteda to do the following steps:
1. Take a random sample of lines from blogs, news articles, and Twitter posts.
2. Concatenate the samples and make tokens of the text.
3. Use the tokens to make ngrams with 3 elements.
4. Make a table of counts and frequencies of the ngrams.
5. Write the table to local disk as a CSV file and as a database to an Amazon cloud
project.
6. SmartType queries the cloud database, finds ngrams and returns the most
frequent one as a suggestion to the user.

SmartType Tokenization

Original text files: blogs (210 MB), news articles (206 MB), and Twitter posts (167 MB). Code for tokenizing the text concatenated from random samples of the original files (sampleAll):
allTokens <- tokens(
char_tolower(sampleAll), what="word", remove_numbers=TRUE,
remove_punct=TRUE, remove_symbols=TRUE, remove_hyphens=TRUE,
remove_twitter=TRUE, remove_url=TRUE
)

• Characters are converted to lowercase (char_tolower).
• Options are passed to make tokens from words (what="word").
• Also, remove tokens that are only numbers and URLs (remove_numbers and remove_url).
• Remove punctuation, symbols, and hyphens (similarly named options).
• Remove the Twitter characters @ and # (remove_twitter).

SmartType Ngrams, Counts and Frequencies

Next, the code uses the tokens to makes ngrams with 3 elements (tokens_ngrams()). Then a document-feature matrix is made (dfm()) to next make a table with counts and frequencies of the ngrams (textstat_frequency()):
allNgrams <- tokens_ngrams(allTokens, n=3)
allDfm <- dfm(allNgrams, tolower=TRUE)
txtFreq <- textstat_frequency(allDfm)

Resulting table, of size 285 MB, is saved to local disk for future use. To optimize performance and enable use in the app, it is uploaded to an Amazon cloud project.

In this way, the app can send queries to the cloud database and find ngrams. The ngrams searched are based on the last two words entered by the user. The suggestion given to the user is the most frequent ngram.

SmartType App

• User types some words into the input box.
• Clicking on Submit with at least two words will give a suggestion.